Outline: Introduction | Summary of Data | Univariate Analysis | Bivariate Analysis | Multivariate Analysis | Regression Analysis | Summary
The goal of this project is to determine chemical properties that influence the quality of red wines. Since this is a project for data uni-, bi-, and multivariate data visualization in R, the crux of the analyses focuses on such data visualization and miltivariate analyses. The visualizations provide depictions on variable distributions and relations between variables, which lend to model building in the regression analyses.
I don’t drink, so, I can’t provide much personal opinions on what chemical properties affecting red wine’s quality. If this was my thesis, this section would be filled with literature reviews on wine preference and red wine. However, for this project, I’ll limit my literature review based on what I gathered from the paper that included the scoring methodology and dataset.
Different researchers have different preference on how the approach data but the end goal should be very similar, i.e. to answer the research question(s). My approach is to combine my programming structure with the analysis story, which for the most, should coincide.
Before reading and exploring data, I like to reserve the intro section of R programming to gather the needed packages and of course data reading process.
library(car); library(plyr); library(reshape2); library(data.table);
library(ggplot2); library(GGally); library(memisc); library(grid);
library(gridExtra); library(MASS); library(ordinal); library(tinytex)
############ Data Read ############
getwd() # Current Directory
## [1] "C:/Users/FA279J/Documents/Edu/DAND/rProject"
setwd("C:/Users/FA279J/Documents/Edu/DAND/rProject")
# list.files()
red1 <- read.csv("wineQualityReds.csv", header=TRUE, fill=TRUE)
Obtaining data information and statistics allows researchers to not only get introduced to the data but before that to check if the data are read correctly. From experience, it’s a good practice to observe the overall data outlook:
sapply(red1, class)
## num fixed.acidity volatile.acidity
## "integer" "numeric" "numeric"
## citric.acid residual.sugar chlorides
## "numeric" "numeric" "numeric"
## free.sulfur.dioxide total.sulfur.dioxide density
## "numeric" "numeric" "numeric"
## pH sulphates alcohol
## "numeric" "numeric" "numeric"
## quality
## "integer"
head(red1)
## num fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
My univariate analyses are based on each variable’s distribution based on summary() and histogram. From these analyses I observed:
############ Data Summary ############
cat("\n\nThe dataset's dimension is ",dim(red1),".\n")
##
##
## The dataset's dimension is 1599 13 .
cat("\n\nUnivariate Statistics of Variables:-\n")
##
##
## Univariate Statistics of Variables:-
summary(red1)
## num fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
p01 <- ggplot(red1, aes(x=fixed.acidity)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('Fixed Acidity, g/dm^3') +
geom_vline(aes(xintercept = mean(fixed.acidity)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(fixed.acidity)), col = 'grey', size=1)
p02 <- ggplot(red1, aes(x=volatile.acidity)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('Volatile Acidity, g/dm^3') +
geom_vline(aes(xintercept = mean(volatile.acidity)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(volatile.acidity)), col = 'grey', size=1)
p03 <- ggplot(red1, aes(x=citric.acid)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('Citric Acid, g/dm^3') +
geom_vline(aes(xintercept = mean(citric.acid)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(citric.acid)), col = 'grey', size=1)
p04 <- ggplot(red1, aes(x=residual.sugar)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('Residual Sugar, g/dm^3') +
geom_vline(aes(xintercept = mean(residual.sugar)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(residual.sugar)), col = 'grey', size=1)
p05 <- ggplot(red1, aes(x=chlorides)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('Sodium Chloride, g/dm^3') +
geom_vline(aes(xintercept = mean(chlorides)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(chlorides)), col = 'grey', size=1)
p06 <- ggplot(red1, aes(x=free.sulfur.dioxide)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('Free Sulfur Dioxide, mg/dm^3') +
geom_vline(aes(xintercept = mean(free.sulfur.dioxide)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(free.sulfur.dioxide)), col = 'grey', size=1)
p07 <- ggplot(red1, aes(x=total.sulfur.dioxide)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('Total Sulfur Dioxide, mg/dm^3') +
geom_vline(aes(xintercept = mean(total.sulfur.dioxide)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(total.sulfur.dioxide)),
col = 'grey', size=1)
p08 <- ggplot(red1, aes(x=density)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('Density, g/cm^3') +
geom_vline(aes(xintercept = mean(density)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(density)), col = 'grey', size=1)
p09 <- ggplot(red1, aes(x=pH)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('pH') +
geom_vline(aes(xintercept = mean(pH)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(pH)), col = 'grey', size=1)
p10 <- ggplot(red1, aes(x=sulphates)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('Potassium Sulphate, g/dm3') +
geom_vline(aes(xintercept = mean(sulphates)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(sulphates)), col = 'grey', size=1)
p11 <- ggplot(red1, aes(x=alcohol)) +
geom_histogram(color = 'black',fill = I('pink')) +
xlab('Alcohol, % by Volume') +
geom_vline(aes(xintercept = mean(alcohol)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(alcohol)), col = 'grey', size=1)
p12 <- ggplot(red1, aes(x=quality)) +
geom_histogram(binwidth=1.0, color = 'maroon',fill = I('violet')) +
xlab('Quality, 0-10') +
scale_x_continuous(limits = c(3,9), breaks = c(4,5,6,7,8)) +
geom_vline(aes(xintercept = mean(quality)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(quality)), col = 'grey', size=1)
grid.arrange(p01, p02, p03, p04, p05, p06, p07, p08, p09, p10, p11, p12, ncol=3)
Normality Assumption
Non-normally distributed variables may pose problems in the regression analyses. If these variables seriously deviate from a normal distribution, researchers should transform these variables or categorize them accordingly. Plots below represent examples used to test for variable normality. I’ve used it for the three sets of non-normal distribution listed above. The plottings of all the questionable take up space. So, I am sharing some plot examples.
In the examples below, we see that both chlorides and total.sulfur.dioxide seem more normal after a natural log transformation. With a ggplot, I may adjust the binwidth to obtain a better distribution depiction. I have a personal preference for a natural log transformation due to its meaningful insights compared to other transformation - although later on, I tested for non-linear relationship with squared variables. In a linear regression analysis, for instance, a natural-log transformed independent variable’s parameter estimate may be interpreted as “a one-percent change in the independent variable is related to a
Below are two examples on how I checked for variable’s normality distribution. I’d change the variables accordingly.
### Examples
# 1. Density Plot
d01 <- ggplot(red1,aes(x=chlorides)) + geom_density() +
ggtitle('Density Plot: Chlorides')
d02 <- ggplot(red1,aes(x=chlorides)) + geom_density() + scale_x_log10() +
ggtitle('Density Plot: Chlorides, Log Scale')
grid.arrange(d01,d02,ncol=1)
### Examples
# 2. Histogram
h01 <- ggplot(red1, aes(x=total.sulfur.dioxide)) +
geom_histogram(color='red', fill='cyan') +
ggtitle('Histogram: Total Sulfur Dioxide') +
geom_vline(aes(xintercept = mean(total.sulfur.dioxide)),
col='red',size=0.5) +
geom_vline(aes(xintercept = median(total.sulfur.dioxide)),
col = 'grey', size=1) +
annotate("text", x = mean(red1$total.sulfur.dioxide) * 1.5, y = 150,
label = paste0("Avg: ", round(mean(red1$total.sulfur.dioxide),1))) +
annotate("text", x = median(red1$total.sulfur.dioxide) * 1.1, y = 200,
label = paste0("Med: ", round(median(red1$total.sulfur.dioxide),1)))
h02 <- ggplot(red1, aes(x=total.sulfur.dioxide)) +
geom_histogram(color='red', fill='pink') +
scale_x_log10()+
ggtitle('Histogram: Total Sulfur Dioxide, Log Scale') +
geom_vline(aes(xintercept = mean(total.sulfur.dioxide)),
col='red',size=0.5) +
geom_vline(aes(xintercept = median(total.sulfur.dioxide)),
col = 'grey', size=1) +
annotate("text", x = mean(red1$total.sulfur.dioxide) * 1.2, y = 100,
label = paste0("Avg: ", round(mean(red1$total.sulfur.dioxide),1))) +
annotate("text", x = median(red1$total.sulfur.dioxide) * 0.9, y = 125,
label = paste0("Med: ", round(median(red1$total.sulfur.dioxide),1)))
grid.arrange(h01, h02, ncol=1)
Variable Transformation
Besides a natural log transformation, I considered categorizing continuous variables and squared categorization. If my knowledge on these chemical properties were solid, I’d group them according to these meaningful categorization. I considered a binary, three-group, and four-group categorizations based on variable distribution and regression analysis results. The resulting categorization should not be too small percentage-wise that it may cause regression estimation issues.
### Categorization
# Quality, fixed.acidity, volatile.acidity, chlorides
red1$quality.f <- cut(red1$quality, breaks = c(0,4,5,6,10),
labels=c("4","5","6","7"))
red1$quality.3 <- cut(red1$quality, breaks = c(0,5,6,10),
labels=c("5","6","7"))
# Residual Sugar and Chlorides Transformations
red1$residual.sugar3 <- cut(red1$residual.sugar, breaks=c(-Inf, 2.2, 3, Inf),
labels=c("low","middle","high"))
red1$ln.residual.sugar <- log(red1$residual.sugar)
red1$ln.chlorides <- log(red1$chlorides)
red1$chlorides4 <- cut(red1$chlorides, breaks=c(-Inf, 0.07,0.08, 0.1, Inf),
labels=c("Q1","Q2","Q3","Q4"))
red1$chlorides2 <- cut(red1$chlorides, breaks=c(-Inf, 0.1, Inf),
labels=c("low","high"))
### Other Transformations
red1$ln.fixed.acidity <- log(red1$fixed.acidity)
red1$ln.volatile.acidity <- log(red1$volatile.acidity)
red1$ln.sulphates <- log(red1$sulphates)
red1$ln.total.sulfur.dioxide <- log(red1$total.sulfur.dioxide)
# non-linear
red1$residual.sugar.sq <- (red1$residual.sugar)**2
red1$alcohol.sq <- (red1$alcohol)**2
Quality Categorized
The categorized quality variable resulted in three categories: low (5), medium (6), and high(7). The groups corresponding size (percentage) are 744 (46.5%), 638 (39.9%), and 217 (13.6%). The high group may pose a concern for some categorical analysis, but it should be large enough for most analysis.
ggplot(red1, aes(x=quality.3)) +
geom_bar(color='blue', fill='cyan', stat="count") +
ggtitle('Histogram: Quality Categorized') +
geom_text(aes(label = ..count.., y= ..prop..), stat= "count", vjust = -.5) +
theme_light()
One can see that alcohol is strongly correlated to quality with fixed.acidity, volatile.acidity and sulphates having weak correlations to quality. These correlations may in the end tell which chemical properties are important. Though, it’s important note that these are untransformed variables and the strength of variables may be strengthen or dampen when other variables are considered in a model.
ggpairs(red1[c("fixed.acidity","volatile.acidity","citric.acid",
"residual.sugar","chlorides","quality")],
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
ggpairs(red1[c("free.sulfur.dioxide","total.sulfur.dioxide","density",
"pH","sulphates","alcohol", "quality")],
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
The correlation matrix is great at showing an overall picture. However, some relations or charts may need some detailed inspections, which may lead to data transformation. From the correlation matrix above, we can see that:
ggcorr(red1
#method = c("all.obs","spearman"), nbreaks = 4, label = TRUE
#name = "spearman r")+
, nbreaks = 4, label = TRUE,
hjust=0.8, angle=-70,size=3) +
ggtitle("Correlation Matrix")
## Warning in ggcorr(red1, nbreaks = 4, label = TRUE, hjust = 0.8, angle
## = -70, : data in column(s) 'quality.f', 'quality.3', 'residual.sugar3',
## 'chlorides4', 'chlorides2' are not numeric and were ignored
Non-Linear Relations
I looked into variables that may be related to quality in a non-linear fashion. My hypothesis is that properties that are very strong or even very week may not produce high quality red wine. Thus, I expected an inverse-U relationship for acid-like variables with quality or the relationship may plateau after a certain point. I used the plot below to test of individual variable’s relationship to quality.
The main interesting non-linear finding regards to quality-alcohol relationship. It seems that there’s a strong positive relationship for alcohol concentration between 10 and 12 percent. Outside this alcohol range, the relationship plateaus.
ggplot(red1, aes(alcohol, quality)) +
geom_point() +
geom_smooth() +
ggtitle('Evicence for Non-Linear Relations?')
## `geom_smooth()` using method = 'gam'
Boxplots
Boxplots below show many similar bivariate relationships as the ones obtained from correlation matrix. With respect to quality:
Since quality can be analyzed as a categorical variable, I also compared the quality and other variables treating quality as a categorical variable. Due to smaller proportions for qualities of ‘4 or less’ and ‘8 or more’, I grouped them together with ‘5’ and ‘7’ respectively - what I coined as ‘quality.3’. Hopefully, a bivariate analysis between quality.3 and other variables will provide more hints of additional important variables.
After a few iterations with the stratified boxplot and distribution -as depicted in the bivariate analysis above-, it seems that ln.sulphates definitely needs to be considered based on the boxplot. Other variables that I will look into include total.sulfur.dioxide, chlorides2, and pH. I may consider more variables but I have to keep in mind that some of these variables are highly correlated and maybe related to other variables. For instance, we expect free.sulfur.dioxide to be a component of total.sulfur.dioxide - they actually have a correlation of 66%. So, I would be wary of including both variables in the model due to multicollinearity, which should be reflected in the regression statistics when comparing models with one of the two variables vs. both variables.
Other relationships may not be obvious such as alcohol and pH. Alcohol tends to be neutral, thus, having a pH value around 7. The data depicts a low 21% correlation between these two variable. Two important correlations here are:
Since acid have lower pH value, I expected a strong negative correlation of acidic substance with the pH value, as shown by that of fixed.acidity. However, volatile.acidity and pH shows a positive but weak correlation.
I am a big fan of using the matrix to look into overall trends and variable association. So, for the multivariate statistics, I am doing the same. I employed the same matrix plot but this time, I added the three-level categorical variable as the color in the correlation matrix.
My takeaways on what variables to test for the regression analyses were very similar to the ones obtained in the univariate and bivariate sections. Based on the chart below:
ggplot(aes(y = quality, x = alcohol, color = total.sulfur.dioxide),
data = red1) +
scale_color_gradient(low="yellow", high="darkblue") +
geom_point(aes(size=volatile.acidity), alpha = 0.75,
position = 'jitter', shape=1) +
geom_smooth(method = lm) +
ggtitle('Quality x Alcohol,
\nControlling for Total Sulfur Dioxide and Volatile Acidity') +
labs(y='Quality', x='Alcohol, % by Volume') +
theme_light()
The section on quality vs. chlorides from the multivariate matrix plot above was too small. So, I created a stratified box plot below, hoping to gain more insights on the relationship of pH and chlorides to red wine quality. It’s still hard to gauge if there’s such a relationship. High chloride wines tend to have slightly lower quality.
ggplot(red1, aes(x=chlorides2, y=pH, fill=chlorides2)) +
geom_boxplot() +
labs(title="pH Distribution, \nBy Quality and Chlorides",
x='Sodium Chloride, g/dm^3', y='pH') +
facet_wrap(~quality.3)
I also looked into how citric.acid and density may be related to quality. Based on the chart below, the relationships to quality are spurious. There are no noticeable trends even when the analyses were stratified by citric.acid and color-coded by density. The only thing that popped up was a positive quality-citric.acid relationship for high citric.acid group.
ggplot(red1, aes(citric.acid, quality, colour = density)) +
scale_color_gradient(low="cyan", high="darkblue") +
geom_jitter(alpha = .5) +
facet_grid(chlorides2 ~ residual.sugar3, margins = TRUE) +
labs(title="Quality vs. Citric Acid, \nBy Quality and Citric Acid Groups",
x='Citric Acid, g/dm^3', y='Quality') +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
ggplot(aes(y = quality, x = alcohol, color = chlorides4), data = red1) +
geom_point(aes(size=pH), alpha = 0.75, position = 'jitter', shape=1) +
scale_color_brewer(type = 'qual',
guide = guide_legend(title = 'chlorides4',
reverse = F,
override.aes = list(alpha = 1,
size = 2))) +
geom_smooth(aes(group=chlorides4), method = lm) +
ggtitle('Quality x Alcohol, by Chlorides\nControlling for pH') +
labs(x='Alcohol, % by Volume', y='Quality') +
theme_light()
Using the template below, I looked into other variables but none of them seem to be strong.
ggplot(aes(y = quality, x = sulphates, color = total.sulfur.dioxide),
data = red1) +
geom_point(aes(size=volatile.acidity), alpha = 0.75, position = 'jitter') +
ggtitle('Quality x Alcohol, by Chlorides\nControlling for pH') +
labs(x='Potassium Sulphate, g/dm3', y='Quality') +
theme_light()
In the bivariate and multivariate analyses, I’ve identified variables that are likely to be influencing red wine quality. As previously mentioned, based on correlation results, I expect alcohol (r=0.48) and volatile.acidity (-0.39) to be important variables.
I ended up with a linear regression with six significant regressors: alcohol, volatile.acidity, ln.sulphates, total.sulfur.dioxide, and chlorides2 + pH. The overall model was highly significant based on the F test. The R-squared for this final model was 36.7%.
m1 <- lm(quality ~ alcohol, data = red1)
m2 <- update(m1, ~ . + volatile.acidity)
m3 <- update(m2, ~ . + ln.sulphates)
m4 <- update(m3, ~ . + total.sulfur.dioxide)
m5 <- update(m4, ~ . + chlorides2)
m6 <- update(m5, ~ . + pH)
mtable(m1, m2, m3, m4, m5, m6)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = red1)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = red1)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + ln.sulphates,
## data = red1)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + ln.sulphates +
## total.sulfur.dioxide, data = red1)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + ln.sulphates +
## total.sulfur.dioxide + chlorides2, data = red1)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + ln.sulphates +
## total.sulfur.dioxide + chlorides2 + pH, data = red1)
##
## ============================================================================================================
## m1 m2 m3 m4 m5 m6
## ------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.369*** 3.612*** 3.722*** 4.962***
## (0.175) (0.184) (0.184) (0.191) (0.192) (0.373)
## alcohol 0.361*** 0.314*** 0.303*** 0.290*** 0.283*** 0.299***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.156*** -1.134*** -1.082*** -0.975***
## (0.095) (0.097) (0.097) (0.097) (0.101)
## ln.sulphates 0.641*** 0.660*** 0.747*** 0.729***
## (0.077) (0.077) (0.079) (0.079)
## total.sulfur.dioxide -0.002*** -0.002*** -0.002***
## (0.001) (0.001) (0.001)
## chlorides2: high/low -0.208*** -0.245***
## (0.049) (0.049)
## pH -0.442***
## (0.114)
## ------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.345 0.353 0.361 0.366
## adj. R-squared 0.226 0.316 0.344 0.352 0.359 0.364
## sigma 0.710 0.668 0.654 0.650 0.647 0.644
## F 468.267 370.379 280.646 217.574 179.608 153.472
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1587.752 -1578.324 -1569.192 -1561.728
## Deviance 805.870 711.796 682.108 674.111 666.456 660.263
## AIC 3448.114 3251.628 3185.503 3168.648 3152.385 3139.456
## BIC 3464.245 3273.136 3212.389 3200.911 3190.025 3182.473
## N 1599 1599 1599 1599 1599 1599
## ============================================================================================================
When the regressors were ran in an ordinal logistic regression modal, all the regressors were also significant. Highly significant in the linear regression model, pH is almost significant at 1 percent significant level.
ol <- clm(quality.3 ~ alcohol + volatile.acidity + ln.sulphates +
total.sulfur.dioxide + chlorides2 + pH,
data=red1)
summary(ol)
## formula:
## quality.3 ~ alcohol + volatile.acidity + ln.sulphates + total.sulfur.dioxide + chlorides2 + pH
## data: red1
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 1599 -1219.49 2454.98 6(0) 9.85e-12 3.0e+06
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## alcohol 0.946801 0.058960 16.058 < 2e-16 ***
## volatile.acidity -2.722549 0.352178 -7.731 1.07e-14 ***
## ln.sulphates 2.521606 0.265606 9.494 < 2e-16 ***
## total.sulfur.dioxide -0.012909 0.001904 -6.781 1.19e-11 ***
## chlorides2high -0.821801 0.172255 -4.771 1.83e-06 ***
## pH -0.972504 0.379180 -2.565 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 5|6 3.098 1.221 2.537
## 6|7 6.007 1.230 4.883
cat("\n\nOdds Ratio \n")
##
##
## Odds Ratio
exp(coef(ol))
## 5|6 6|7 alcohol
## 22.15944000 406.43443193 2.57745243
## volatile.acidity ln.sulphates total.sulfur.dioxide
## 0.06570705 12.44857820 0.98717358
## chlorides2high pH
## 0.43963900 0.37813505
cat("\n\nProportional Odds Test \n")
##
##
## Proportional Odds Test
nominal_test(ol)
## Tests of nominal effects
##
## formula: quality.3 ~ alcohol + volatile.acidity + ln.sulphates + total.sulfur.dioxide + chlorides2 + pH
## Df logLik AIC LRT Pr(>Chi)
## <none> -1219.5 2455.0
## alcohol 1 -1219.0 2456.1 0.90611 0.34115
## volatile.acidity 1 -1218.5 2454.9 2.04794 0.15241
## ln.sulphates 1 -1218.0 2454.0 3.01188 0.08266 .
## total.sulfur.dioxide 1 -1219.5 2456.9 0.04466 0.83263
## chlorides2 1 -1219.3 2456.5 0.43857 0.50781
## pH
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Proportional odds ratio test insignificant results indicate that we can assume that the odds ratio between the three levels of quality are proportionate. This means that the current ordinal logistic regression can be used that I don’t have to resort to a multinomial logistic regression instead.
Data Outlook
Given that the dataset does not contain too many variables, I started by looking into a scatter plot matrix. Initially, I look into the diagonally-placed distribution plots. Then, I revisited this plot to get a quick picture on relations between variables. From the scatterplot matrix below, I can quickly see the followings:
One can see that alcohol is strongly correlated to quality with fixed.acidity, volatile.acidity and sulphates having weak correlations to quality. These correlations may in the end tell which chemical properties are important. Though, it’s important note that these are untransformed variables and the strength of variables may be strengthen or dampen when other variables are considered in a model.
ggpairs(red1[c("alcohol","volatile.acidity","ln.sulphates",
"total.sulfur.dioxide","chlorides4","pH","quality")],
diag = list(continuous = wrap("densityDiag",
color = "purple", alpha = 0.5)),
lower = list(continuous = wrap("points",
color = "red", shape = I('.'))),
upper = list(combo = wrap("box",
color = "darkblue", outlier.shape = I('.')))) +
ggtitle('Correlation Matrix\nSelected Variables') +
theme(plot.title = element_text(hjust = 0.5))
This correlation matrix in this report actually serves as an important bivariate analysis tool. For instance, I can immediate see chemical properties that influence the quality of red wine. In my multivariate analysis, I will immediately include alcohol (r=0.48) and volatile.acidity (-0.39) in the quality model from the get-go. Then, I need to consider total.sulfur.dioxide, and citric acid, including their transformed and categorized variables.
Quality as a Categorical Variable
Since quality can be analyzed as a categorical variable, I also compared the quality and other variables treating quality as a categorical variable. Due to smaller proportions for qualities of ‘4 or less’ and ‘8 or more’, I grouped them together with ‘5’ and ‘7’ respectively - what I coined as ‘quality.3’. Hopefully, a bivariate analysis between quality.3 and other variables will provide more hints of additional important variables.
ggpairs(data=red1,
columns=c("quality.3","alcohol","volatile.acidity","ln.sulphates"
,"total.sulfur.dioxide","chlorides2","pH"),
title="Bivariate Analysis",
aes(color = quality.3, alpha = .9),
axisLabels = 'internal') +
ggtitle('Quality Categorized,
\nBox Plots and Stratified Distribution') +
theme(plot.title = element_text(hjust = 0.5))
After a few iterations with the stratified boxplot and distribution -as depicted in the bivariate analysis above-, it seems that ln.sulphates definitely needs to be considered based on the boxplot. Other variables that I will look into include total.sulfur.dioxide, chlorides2, and pH. I may consider more variables but I have to keep in mind that some of these variables are highly correlated and maybe related to other variables. For instance, we expect free.sulfur.dioxide to be a component of total.sulfur.dioxide - they actually have a correlation of 66%. So, I would be wary of including both variables in the model due to multicollinearity, which should be reflected in the regression statistics when comparing models with one of the two variables vs. both variables.
Other relationships may not be obvious such as alcohol and pH. Alcohol tends to be neutral, thus, having a pH value around 7. The data depicts a low 21% correlation between these two variable. Two important correlations here are:
Since acid have lower pH value, I expected a strong negative correlation of acidic substance with the pH value, as shown by that of fixed.acidity. However, volatile.acidity and pH shows a positive but weak correlation. I would be concerned to introduce pH intovariable with fixed.acidity, but not one with volatile.acidity.
Multivariate Analysis Depiction
Two plot matrix above helped me to narrow down the variables that I should look into. I looked into variables with high correlation (ablsolute value) with quality. These variables are looked into as they are, transformed variables, and categorized variables. I am aware that sometimes a variable may not be highly correlated but may become significant when introduced into a model along with other variables.
In any case, a useful plot to depict various variables to quality can be represented in a chart like the one below. In this chart below: * alcohol and quality seems to be positively related based on the trend line; * volatile.acidity and quality seems to be negatively related based on lower quality tend to have larger circle points; and * total.sulfur.dioxide and quality may not be related because the the darker circles are scattered along different quality values.
I also tested several transformed variables in the model. One of my hypotheses is that properties that are very strong or even very week may not produce high quality red wine. Thus, I expected an inverse-U relationship for acid-like variables with quality.
ggplot(aes(y = quality, x = alcohol, color = total.sulfur.dioxide),
data = red1) +
scale_color_gradient(low="yellow", high="darkblue") +
geom_point(aes(size=volatile.acidity), alpha = 0.75,
position = 'jitter', shape=1) +
geom_smooth(method = lm) +
ggtitle('Quality vs. Alcohol,
\nControlling for Total Sulfur Dioxide and Volatile Acidity') +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x='Alcohol, % by Volume', y='Quality') +
theme_light()
Conclusion from Regression Analyses
Both Linear and Logistic Regression analyses provided similar results. Variables that were positively related to quality were alcohol and ln.sulphates. Volatile.acidity, total.sulfur.dioxide, chlorides, and pH were negatively associated with quality.
Linear regression results indicates, given the other variables are held constant in the model:
a one-percent increase in sulphates (potassium sulphate - g / dm3) is associated to an increase in red wine quality by as score of 0.73;
a one-unit increase in pH score is related to a decrease in red wine quality by almost a half score.
Logistic regression results indicates, given the other variables are held constant in the model:
for a 10% g/dm3 increase in sulphates (potassium sulphate), the odds of higher quality versus next highest quality categories are 3.3 times greater (Note: 1.1^(12.44)=3.3; See: “Interpretation of log transformed predictors in logistic regression)”,
for a one-unit decrease in pH score, the odds of higher quality versus next highest quality categories are 2.6 times greater.
The uni-, bi-, and multivariate analyses really helped to focus on the important variables, best transformation, and appropriate transformation. I’m aware that there are statistics for many of these assumption tests such as test for normality. However, visual representations really helped to see which transformations and categorization cuts points would be the better ones.
Alcohol was the best predictor of quality, followed by acidity. The results truly make sense because alcohol is what makes wine an alcoholic beverage. As for acidity, I suppose people are attracted to the soda-like strength in a wine.
I was really hoping that the squared variables would be significant. To me, some chemical properties maybe associated to quality but after a point, the chemical properties may be too strong that the wine would be rated as low quality ones. Based on this results, one cannot recommend that we max out the chemical properties positively related to quality while minimize or eliminate those which are negatively related to quality. Since the red wines tested here tend to have average scores among expert, implications based on this study should only be generalized to average red wine.
Although squared variables were not significant, the hypothesis on extreme chemical properties being less desired is supported in chlorides findings. Chlorides actually had many small positive outliers. I found that extremely high chlorides tend to have lower quality compared to the first three quartiles. The variable was not significant when tested as a continuous and log-transformed variables. When included as quartiles, only the largest quartile was significant.
The validity of the study and findings may have been affected by the data quality. The methodology here plays a big part because I know that at least three testers were involved. If they tested several wines in a subsequent manner, their taste buds may be affected. And if they tested way too many wines at once, their judgments will certainly be impaired, resulting in invalid score.
We also do not have data on raters and if the testing were done in one session. Such data would allow researchers to control for the wine tester and testing session.
Most of the red wines were given average scores or 5, 6, and 7. We could have benefited from testing really bad and expensive wine or wine with extreme chemical properties for more variance in the score. Perhaps, a dataset with more variance can better explain the variance in quality and also capture a U-inverse relationship. In addition, a better quality scoring could incorporate various taste elements such as the five characteristics of wine such as sweetness, acidicity, tannin, fruit, and body.
Also stated in the data note, the “median of at least 3 evaluations made by wine experts” were taken. Taking the median score may be problematic due to the tendency to arrive at scores closer to “5”. If I had each evaluator’s valuation for each wine, I may be able to remove scores with high variance based on evaluation validity. Since we are generalizing the results to the general population, I am not convinced that sampling by experts would necessarily represent the larger population taste and preference.
As laid out in the simple data note, “there is no data about grape types, wine brand, wine selling price”. Grape types, wine brand, and wine selling price are arguably important factors in determining wine quality. One may argue that selling price may be a better reflection of wine quality although each wine’s supply factor may serve as a counter-argument.
A nice segway to my last course of this Nanodegree - the multinomial analyses in this study focused on regression ones when there are other Machine Learning tools that may better capture chemical properties that influence wine quality.
ggplot(aes(y = quality, x = alcohol, color = chlorides4),
data = subset(red1,alcohol>=9 & alcohol<=14)) +
geom_point(alpha = 0.75, position = 'jitter') +
scale_color_brewer(type = 'qual',
guide = guide_legend(title = 'chlorides4',
reverse = F,
override.aes = list(alpha = 1,
size = 2))) +
geom_smooth(aes(group=chlorides4), method = loess) +
ggtitle('Quality x Alcohol, Loess Trend Fitting') +
labs(x='Alcohol, % by Volume', y='Quality') +
theme_light()
From the chart above, one can see that quality may be affected by these variables in unique, non-systematic way which may require some algorithm. These imperfect association to quality were also depicted throughout this project. I look forward to learning various machine learning method in the Intro to Machine Learning course.
Changing Title in Multiplot ggplot2 Using grid.arrange: https://stackoverflow.com/questions/14726078/changing-title-in-multiplot-ggplot2-using-grid-arrange
Customizing RStudio: https://support.rstudio.com/hc/en-us/articles/200549016-Customizing-RStudio#editing
Data Transformation with dplyr: CHEAT SHEET: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
Draw a Trend Line Using ggplot: https://stackoverflow.com/questions/38412817/draw-a-trend-line-using-ggplot
ggplot2 Quick Reference: colour (and fill): http://sape.inf.usi.ch/quick-reference/ggplot2/colour
How Basic Wine Characteristics Help You Find Favorites: http://winefolly.com/review/wine-characteristics/
How to add mean, and mode to ggplot histogram?: https://stackoverflow.com/questions/47000494/how-to-add-mean-and-mode-to-ggplot-histogram
Interpretation of Log Transformed Predictors in Logistic Regression: https://stats.stackexchange.com/questions/8318/interpretation-of-log-transformed-predictors-in-logistic-regression
Ordinal Logistic Regression | R Data Analysis Examples: https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/
R Markdown Cheat Sheet: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
RDocumentation: nominal_test: https://www.rdocumentation.org/packages/ordinal/versions/2015.6-28/topics/nominal_test
Recode Data in R: http://rprogramming.net/recode-data-in-r/
Wine Quality Methodology: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt